1 Recap

So, in the first workshop we introduced you to R and RStudio, and taught you the basic ways in which you can use R. To quickly refresh, we covered:

  • Working in the console vs writing a script
  • Annotating your code
  • Assigning objects
  • Inf, NaN, NA
  • Naming conventions
  • Vectors (numeric, character, logical)
  • Accessing data using []
  • Logical operators (==, !=, >, etc)
  • Functions
  • Writing your own functions
  • Getting help

In this practical we are going to take a bit of a sideways step - learning skills which are parallel to R but not actually to do with R itself. These will be critical skills to ensuring you can easily create widely shareable open science projects, and will also be used during your projects later in the year.

1.1 Homework solution

At the end of the last session I set homework to do some basic calculations using R. Here is my code to run those calculations:

## create a vector of random numners
ran_50 <- runif(100, 0, 50)

##order them from smallest to largest
sort_ran_50 <- sort(ran_50)

##write the function

my_fun <- function(x){
  
  ##subtract log10(x) from x
  y <- x - log10(x)
  
  ##return the new vector
  return(y)
}

##run the function on your random numbers
new_data <- my_fun(sort_ran_50)

##calculate mean, sd, and se:

##calcualte mean
mean_dat <- mean(new_data)

##calcualte SD
sd_dat <- sd(new_data)

##the function for se
se <- function(x)sd(x)/sqrt(length(x))

se_dat <- se(new_data)

##results

results <- c("mean" = mean_dat, 
             "sd" = sd_dat,
             "se" = se_dat)

Now, on to this weeks work.

2 Github

2.1 What is Github? Actually, what is Git?!

Github is an online repository aimed at making two things easy: (1) sharing/collaborating on code and (2) version control. Github is actually a front end to Git - the underlying code that makes the sharing/collaborating and version control actually work. Github gives us a nice user-friendly front end where we can do all the important stuff Git does whilst not having to learn the Git commands ourself. Github is widely used to assist in writing code, particularly when there are many different people involved in writing complex scripts which are hard to keep track of. In essence it will save your R folders from looking like this:

2.2 How does it work?

The figure to the right gives you an approximate idea of how Git and Github work. Lets work from the bottom up.

Github works with what are called repositories. These are folders on your computer which contain code, data, and files called read me files which describe what the repository contains. These files can be organised into sub folders (in the diagram to the right we have two subfolders in Repository 1 called Data and R code). You will end up with lots of these repositories, usually one for each project you are working on.

The code, data, and other files in these repositories are tracked and catalouged by a GUI for Github called Github Desktop. As we said earlier Github is an online repository, and this desktop GUI allows us to sync our desktop repositories with the stored versioins of these repositories online.

These repositories are stored under your profile at Github.com - you can see mine here. A repository can be marked as private (only you and invited collaborators can work on it) or public (anyone can download your code and data).

Github is actually a front end for Git, which does all the heavy lifting of tracking the changes to code, data, etc which form the backbone of what Github is useful for.

Git/Github work in much the same way as track-changes does in Microsoft Word (if you have used that). You will make an initial copy of your repository (including any code which is already in there) and you will sync this to Github.com. Then, whenever you make changes to the documents in your repository, the Github desktop GUI will track these changes and then when you ask it to it will upload these changes onto Github.com.

Today you will learn how to set up a Github account and start a repository, learn about how to see what changes have been made to the files in your repository, and how to branch and clone repositories to get access to other peoples code and data.

You will then be adding code and documents to your Github repository as we go through this course, culminating in the submission of your projects through Github.

2.3 Making a Github account

The first thing to do is make a Github account - this will give you access to all the magical powers of Github. And it’s free! Go to the Github homepage and you should see a big box on the right hand side saying “Sign up”. Alternatively try here. Go through the steps and when you have the option select “Individual - Free” in the account type page. When you have verified your account you will be asked if you want to create your first repository - don’t do this, insted either quit your browser window or navigate to https://github.com/.

2.4 Installing Github desktop

Next you’ll need to download and install Github desktop, selecting the correct operating system. Once this is installed you will need to sign into your Github account (through the prompted window). Great, now we can start using Github.

2.5 Making a repository

There are three was of making a github repository. The simplest and best (in my oppinion) is to to create one via Github desktop.

NOTE - Github repositories have some issues when created in Google Drive, OneDrive or Dropbox on your computer. The solution to this is to ensure that all repositories are created on your local computer and are not being synced via any of these services.

To do this navigate toe File -> New repository:

Which will bring up the following box:

Here will will tell Github some basic information about our repository. First off we need to name it. I will leave this up to you, but consider this is going to be the repository you are using the this R course over the next few weeks (i.e. name it something meaningly, not “Repository 1”!). The add a description - this doesnt need to be lots of detail, just a few lines on what this repository is going to be used for.

Next choose a place to make this repository. Where this will be depends on how you organise your files on your computer - I have a Work folder inside which I keep folders for each of my projects. Because Github repositories are limited to 1GB of storage, it sometimes isnt appropriate to keep all of the files for a project in your repository. So in the example below you might just keep the R code in a github repository, whilst the rest is kept on your local drive (but is obviously backed up somewhere else!).

So, decide where you are going to place your repository using the “Local Path” option

Select the “Initialize this repository with a README” box - we will go into that more in a minute.

The Git Ignore option tells Github what sort of files are going into this repository. Select R from the drop down list. We do this so that Github doesn’t include any temporary files R creates whilst running code in the upload to our online Github repository.

We can ignore the “License” bit - but just know that if you are writing packages or libraries in the future it will be worth looking into these and deciding what sort of license you wan’t to apply to your code (for an explanation of the different types see here).

So for me, the box now looks like this:

Go ahead and click Create Repository!

If you now navigate to the Local Path you specified when you initilised the repository you should now see that Github has created a folder on your computer, if you open that folder you will see a README.md file.

2.6 README.md

Let’s briefly cover the README.md file that Github created. If you open this (using a text editor of some description - I use TextEdit on Mac OS) you will see that the README file has initiated with the title and description you put in during the repository initialisation earlier. Mine are:

# Bioinformatics test repository
 An example repository for the Bioinformatics masters course

This README file is (as the name suggests) a file you should read when you are interested in what is in the repository. It gives you a place to give details about the data and methods, links to the publication/s which this repository supports, the author/s of the code, owners of the data, help guides for using the methods, etc etc.

the # denotes a title - one # is the largest title format, ## is the next largest title, and so on.

For now add the a title with two ## and your name to the README.md (ensuring there is at least 1 empty line below your name in the .md file) and save and close the file (we will pick this up again later).

# Bioinformatics test repository
 An example repository for the Bioinformatics masters course

## Author

Chris Clements

2.7 Getting your repository online

So far you have created a repository, but it is not synced to your online Github repository. If you open Github Desktop you will see something which looks like this (it might vary a bit between Windows and Mac):

This is where we will interface between your changes to your files and Github online. There are a number of we need to know.

  1. this menu shows you what repository you are working on - i.e. what repository the information and options being displayed are going to make changes to.
  2. this tells us which branch of the repository we are working on (more on that below)
  3. this button is the one that publishes the changes we have made on our local repository to the online version of that repository. This is a key thing to undestand - this does not happen automatically!
  4. when “Changes” is selected you are viewing changes made in this repository, and the list of files which have been changed will be displaed here. In this case it is telling us that we made changes to the README.md file.
  5. this box visualises what changes we have made. You can see that the README.md file is being displayed, and our changes (adding “Autor” and our name) have been added to this file - the green denotes these lines of text have been added, red denotes code which has been taken away.
  6. this box is for us to give some information on the changes we have made since we last “Committed” to our brach.

So, what is Committing?

2.8 Committing

A “commit” is basically Github’s way of saying “save”. When you make a “commit” you are saying that you want to save the project at this point with any associated changes to the files within the repository. This will become critical later as we start to think about version control - the ability to go back to a previous version of the repository and start again from there. There is a balance to be considered regarding how often you commit - if you commit every small change it is very difficult to find the point you want to go back to, if you don’t commit enough thAlong with the commit we add comments to say what we have done since the last time we committed. If we look again at the following:

We can see that number 6 is both where we add these comments and where we make the commit. Its given us a suggested title for this commit, but lets write in “Initial commit” to signify this is the first time we have made changes to this repository, and in the description you can make some notes saying that this is the initial commit and you have made changes to the README.md.

Once you have done this then go ahead and click “Commit to master”. Note the bold, this signifys we are commiting to the master branch (number 2 in the figure, more on branches below).

You will see that once you have clicked “Commit to master” then all the information displayed disappears and you are left with a sign saying “No local changes”. This is telling us that there have been no changes to any of the files in our repository since we last committed.

You will also see below the “No local changes” that there is a message saying that our repository is only available on our local machine, with a prompt to “Publish your repository to GitHub”. This highlights an important point - a “commit” is local. To make these changes appear on our Github profile online we need to “push” these changes to Github.

2.9 Pushing

“Pushing” is Github speak for making sure our local changes to our repositories are published to our online repository. Because we haven’t pushed any changes before, in this case “pushing” to Github will also publish our repository. After your initial commit the button will read “Push Origin”. That’s what we want to do, so go ahead and click “Publish your repository to GitHub”. You will see a box appear giving you some options:

Here we can change the name of our repository and description, and we have the option to keep this code private (make sure this is slected - you can make your code public at any time, but once you have made your code public you cant make it private again!).

You can ignore the Organization section too as this isnt relevant (if you want to know more about them then have a read here).

Once you are happy, click “Publish Repository”.

2.10 Github online

So we have published our Repository. Let’s login to Github online and see what it looks like there.

Once you have logged in navigate to your profile, and then click on “Repositories”. You should now see the repository you just published. If you click on the repository you will then see the following:

There is a list of the files in your repository (not much at the moment), and the README.md is conveniently displayed as a formatted document below.

Some information on the history of your repository is displayed at the top - the number of commits and number of branches (see below) are the ones most relevant to you. There are also some useful buttons it is worth noting at this point - the “Clone or download” button and “New pull request”. More on those later too.

2.11 Branching and version control (explanation)

Branching and version contron are not the same thing, but they are related. We mentioned version control earlier on: its like track changes in Microsoft word - you can see the changes which have been done, and you can roll back to a previous version of the document if what you are doing didn’t work, or you need to use a new method or technique. Branching is a way of having multiple parallel repositories which you can work on independantly, and then merge the changes back together to form a single repository again. Lets cover branching first.

2.11.1 Branching

When you make a Github repository you create the Master branch - this is the main working branch of the repository. You can make and commit changes to this master branch as we did above.

However, lets imagine we want to carry out a statistical analysis. We think we want to use a generalised linear model (GLM) on our data but we aren’t 100% sure. Instead of doing this coding and committing these changes to the Master branch - and then having to undo them and some later date, a better option is to creat a new branch. This branch will run in parallel to the master branch and alloes us to test out our analysis without being 100% sure it will work.

Take the diagram above. We have our original master branch, and we have branched off from this master (1) and made a commit to that branch (2 - may be some data tidying and sorting before our analysis). Then we think about our analyses - we aren’t sure if a GLM or a generalised linear mixed-effect model (GLMM) framework is going to be most appropriate. So we make another branch (3) and we try out the GLM approach. We make a commit (4) but realise that this isn’t likely to be the best way to analyse these data. We could delete all the GLM work we have just dont and start over again with the GLMM work, but its better to not throw away all our GLM work in case we want to come back to it later. Instead we can just revert back to our original branch (at 3), and develop the analysis there using the GLMM method instead. This leaves our GLM branch hanging by itself, but thats fine - we have a copy of all the code if we ever want to go back to it. Once we are happy that the GLMM approach is the best option and the analysis is finished, we can then merge this code back in with the master branch (5).

2.11.2 Version control

In the above example we have implicitly covered version control - its the ability to move back to an earlier version of our repository (from 4 to 3). In that example we moved back to an earlier position on a different branch, but you can do the same thing along a single branch - moving back to an earlier version of the code you are working on on the same branch. We can roll back to any point we have made a “Commit” - so choosing when and how frequently to make commits is really important, as is keeping good notes to allow you to find exactly which version of the code you want to roll back to. So make your commit notes useful!

2.12 Making and using a branch

Making a branch is easy using Github desktop:

Go ahead and make a branch - pick a name (for this I am just going to use test-branch) and hit Create. Nothing much appears to have changed, but if you look at the Current Branch drop down menu you’ll see that you now have two branches: the master and the test-branch:

We are currently in test-branch and you can return to the master branch by simply clicking on it (but don’t, lets stay in the test-branch for now). On our earlier schematic we are now between 1 and 2:

To see how useful branches can be lets make some changes to our repository. In your repository create a folder called “R code”, then open RStudio and make a new R script. Then lets write some simple R script:

## this command clears R's memory.
## it will delete all of the loaded in data, and any objects and code which has been run.
## its useful to have this at the beginning of all your 
## R projects to make sure anything saved in R's memory isn't effecting your code or results
## (I start almost all my R scripts with this)
rm(list = ls())

##make a vector of random numbers from a uniform distribution:
normal_nums <- rnorm(n = 100, mean = 10, sd = 2)

Then save this into our new R code folder. If you go back to Github desktop you will see these changes (addition of the R code) have appeared in your repository. Go ahead and commit those (with notes!) and Publish branch. If you look at your online repository you will now see you have 2 branches:

If you select the test-branch you’ll see your new folder has been synced - and inside it is your R code. Fantastic! We are now at point 2:

So we know we are working on our branch, what is going on on our master branch? The short answer is nothing. We can view the other branches of our repository at any time from the drop down menu in Github desktop - try it now by select the master branch from the Current branch drop down menu. Now, go back to your repository folder on your local computer and you’ll see that everything you had put into the repository on the brach (the R code folder and the code inside it) has dissapeared! This is because we are back at point 1 on our schematic. You can switch between any one of your branches at any time using this menu - Github stores them all on your computer at the same time, and just allows you to view/edit them when you select them in Github desktop. Smart!

Right, lets go back to our fork and work some more on our R code. Let’s write a simple function, one we wrote last week. Add the following to the R script you are working on in your repository:

##Calculate standard error
se <- function(x){
  ##standard deviation divided by the square root of the number of observations
  std_er<-sd(x)/sqrt(length(x))
  ##return the answer
  return(std_er)
}

Save that and commit it.

Then lets modify the function above so that it returns both the standard error, and the mean, and then try that out to see if it works:

##Calculate standard error
se_mean <- function(x){
  ##standard deviation divided by the square root of the number of observations
  std_er<-sd(x)/sqrt(length(x))
  ##return the answers
  return(c("mean" = mean(x), "se" = std_er))
}

##run it on our vector
se_mean(normal_nums)

If you run your script you should see that that works fine, and you get a vector returned looking like this:

      mean         se 
10.3460555  0.2156303 

Great, lets save that and commit as well. You can push it to the origin too to make sure your Github online repository is at the same stage. We are now at somewhere between points 3 and 5 on our schematic:

2.13 Version control (resetting)

Right, lets imagine that we tried the se_mean function route but have decided that this isnt what we want to do, we want to go back to the original se function instead. We need to do a reset of the file in our repository. Remember doing this means we still have the se_mean version in our history in case we ever need to go back to it or access some of the code. There are a whole host of ways to do this depending on exactly what you want to do. We will cover 2 quick and simple ones which are fine for this prupose.

2.13.1 The lazy (and easy) way

The easiest way is to navigate to your repository on github and get the version of the file you want, download it, and save it over the file you want to reset in your local repository:

Alternativly if you just want part of the code you can copy just the part of the code you want and paste it directly into your R file which can be useful to pull out specific functions or lines of code rather than replacing the whole thing.

2.13.2 The proper way

The other way is to do a reset using the command line. To do this you need to open command line (terminal on mac) and navigate to your repository.

We suggest that you do everything via the Github GUI as it is easier, but if you want then these is the alternative option.

NOTE - the below will not work for Windows OS, and you will need to use different commands. We suggest you use the Github Desktop GUI for simplicity.

However, for completeness, for me to navigate it would be via:

cd ~/Desktop/Work/Bioinformatics-test-repository

where cd means change directory. Once you have navigated to your directory you can look at the files in it using:

ls()

Then look at this status of this repository:

git status

Git will then give you some information on the status of the repository:

Then look at the log for the file you want to reset using the git log command and the pathway to the file. Below I have specified that the file we want to change (Analysis.R) is in the folder R code. If you have a space in your folder name like I do (between the R and Code) you will need to tell command line that it is a space using the \ before the space:

git log R\ code/Analysis.R

This will give you information on the file and what changes it has had:

We then need to find out which version of the file we want to revert to. You can do this through Github desktop (making sure you are on the right branch and right file). To do this navigate to history and find the version you want (this is where you helpful notes and titles will come in useful!) and copy the version number from the top:

Once you have your version number copied (it will look something like a02c0e5) go back to command line/terminal and checkout the version using git checkout, the version number you copied, and the pathway to the file you want to replace again:

git checkout a02c0e5 R\ code/Analysis.R

run that, and then finally reset the file using:

git commit -m "reverted back to prev version"

Done! You will now have the previous version of your file, and you can work on it as normal. You can’t pull out parts of the code as you could with the last option so use this only when you want to reset the code to a specific point in its entirety.

2.14 Merging your changes

So we have talked about working on a branch (the work flow above puts us between points 3 and 5 on the below diagram).

Now, once we are happy with the code we want to merge this back into our Master branch. You can do this via Github desktop. Make sure all of your changes are committed to the test-branch, then select the Master branch from the Current Branch tab, from the bottom of the menu select which branch you want to merge into Master, then go ahead and merge:

This gets more complicated when you have files you want to overwrite/merge together, but we can leave it here for now.

2.15 Cloning or download

So we talked earlier about the other thing Github is great for - sharing code. As you can probably guess there are a lot of ways to do this, but the two we will cover brifly are cloning and downloading repositories.

2.15.1 Downloading a repository

Downloading a repository is the easiest way to access all the data and code in someone elses public repository. To do this simply navigate to their repository on Github.com (lets use my repository https://github.com/chrit88/Bioinformatics_data) and click the green Clone or download button on the right hand side. This gives you the option to Download Zip which will download a ZIP file of the entire repository to your computer. Unzip this and you have all the code, data, and whatever else is in the repository.

You can also download a specific file by navigating to that file, selecting the RAW version, and downloading it to your computer (more on that next week):

2.15.2 Cloning a repository

The other way of accessing a whole repository is by cloning it. Cloning it means that the repository (in its current form) as added as a repository to your Github account. You can the access the repository as if it were one you had made yourself - including syncing it to your desktop through Github Desktop. To do this again navigate to the repository on Github.com you want, and click the green Clone or download button on the right hand side. Then click Open in Desktop. You can then follow the steps to save this repository onto your computer, and use it (committing, pushing changes, etc) exactly as if it were one you wrote. It will also show up on your Github.com profile as well.

2.16 Summary

Great, so now we have a working knowledge of Github and how we can use it for R. Github will be really important for you throughout this course: use it to store your analyses and code ou generate doing these workbooks and for the homework, and later you will use it (in combination with R Markdown) to submit your group projects. There are additional guides available here.

3 R Markdown

The other topic we are going to cover today is R Markdown. R Markdown is a typesetting system which allows you to publish HTML, PDF, or MS word documents which integrate R code and its outputs into the document. Its (thankfully) very easy to learn and use, and produces very intuitive and great looking documents (all of the work sheets for this course have been produced using R Markdown). There are lots of tips online on how to make great looking RMarkdown documents, so this will serve as a very brief introduction and you will then be able to go away and learn how to improve on the basic RMarkdown format.

3.1 Creating an R Markdown file

RStudio actually helps you to set up and use RMarkdown as you can directly create an RMarkdown file using File -> New file -> R Markdown. Go ahead and make a new RMarkdown file - I typically set the output to HTML as its nice and flexibly and resizes objects as you resize the window.

You can see that RStudio also provides you with some example code and instructions to help you get started. Read through those now and try the Knit function at the top of the screen.

3.2 Understanding how RMarkdown deals with R code

Perhaps the most useful thing about RMarkdown is its ability to include R code into the document. This code can either be run invisibly in the background or can be run and shown in the document itself. The important thing to realise is there is effectively an R script BEHIND the RMarkdown script, and that using the ```{r} & ``` commands just allows you to access this script that is there in the background:

So, each R chunk is not treated as a seperate mini script, rather it can use objects made by previous chunks, and can be used by upcoming chunks. This means you can include complex R code into RMarkdown files which can run in the background to produce graphics, simulations, etc.

3.3 Including or excluding R chunks

You saw in the default RMarkdown text that you can include R chunks into your RMarkdown file. Sometimes you might want to display that code so that the reader can understand what you have done. That is done simply by using the introduced ```{r} & ``` commands.

If you want to run the code silently in the background you can use ```{r, echo = FALSE} & ``` and the code will execute (and any plots or outputs will be displayed) but the code run will not appear.

You can also tell RMarkdown to include code as R code (i.e. it will be formatted like R code) but not to run that code using: ```{r, eval = FALSE} & ```:

3.4 Inline code

Inline “code” can be included by surround it with grave accents `. This code wont be run as R code, but will be formatted to look like code.

3.5 Loading required packages

We havent covered packages yet, but they contain functions and data you will use in your R analyses. We will cover them next time, but for now take note you can load them into RMarkdown during the setup R script at the top of the RMarkdown file - below I have loaded in the tidyverse package.

{r setup, include=FALSE}
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE)

3.6 Creating clickable menus

RMarkdown handles a lot of things automatically, such as menu creation and section numbering. You can add a table of contents and section numbers using the code below (this is what I used to produce the document you are reading at the moment). Note the indentation is required for the code to work.

---
title: "**The accelerated R Course:** <br> Practical 2 <br> *Github & RMarkdown*"
author: "Chris Clements"

output: 
  html_document:
    toc: true
    toc_depth: 3
    toc_float: true    
    number_sections: true
---

3.7 Including R Plots

We haven’t covered making plots in R (we will do in an upcoming workshop) but the same principal as our vectos above applies: if the code you are running produces an output that output will be incorporated into your RMarkdown file (whether that is a plot, or the contents of a vector). Figure captions can be included using the fig.cap argument:

Which gives us:

Figure 1. Effects of temperature on pressure.

Figure 1. Effects of temperature on pressure.

3.8 Including pictures/gifs

The easiest way of including images is by saving the image into the same folder where you have the RMarkdown script saved, then you can simply tell RMarkdown the name of the image and it will Knit it into the output using:

<center>
![A caption for the image](the image name.gif)
</center>

The <center> argument before and </center> after tell RMarkdown to start a section where the objects and text should be placed in the middle of the page (<center>), and then to stop doing that (</center>).

3.9 Summary

RMarkdown is intuitive and easy to use, and moreover has loads of help online so the above is just a starting point. RMarkdown will be very useful for you to write your project reports and complete your homework, as you can easily display your code and data anlysis skills to us and your collegues!

4 Key Concepts

Ok, to round up what we have learned today:

  • What Github is and why it is useful
  • Setting up a Guthub account and installing Github Desktop
  • Making a repository
  • Committing and pushing commitments
  • Branching
  • Version control
  • RMarkdown - how to insert R code and images

5 Functions covered today

Function/operator Call
Clear R’s memory rm(list = ls())
Random numbers from a uniform distribution rnorm()

7 Homework

  • Go through and ensure you have understood everything from today’s practical.
  • Set up a github repository for your homework and learning on this Bioinformatics course. You can use this for saving your future homework and code.
  • Clone the repository containing the data for the rest of this course onto your computer: https://github.com/chrit88/Bioinformatics_data)
  • Use RMarkdown to summarise the most important bits of what you have learned so far - i.e. make a sort of cheat sheet for yourself of useful functions or insights. This will be different for everyone, so I won’t tell you what to put on it but I do suggest you include a table of useful functions which you have come across during your reading.